The dataset is given on the context that this is a pre-crawled dataset, taken as subset of a bigger dataset that was created by extracting data from Naukri.com, a leading job board.
sadly it is not known how the data were subseted, whether analysising this data would give a breif overview with regards to job market over a period of time or certain industry.
require(tidyverse)
require(data.table)
require(lubridate)
require(stringr)
require(ggvis)
require(ggplot2)
require(forcats)
require(ggmap)
require(highcharter)
require(broom)
require(plotly)
# setwd(dir="D:\\Users\\gdrtmh\\Desktop\\Kaggle\\Jobs on Naukri")
inputdt <- as.data.table(read.csv(file="naukri_com-job_sample.csv", strip.white = TRUE,
na.strings=c("","NA","Not Mentioned","Not Disclosed by Recruiter"),
stringsAsFactors = FALSE))
# setting "Not mentioned" as NA for experience etc.
# setting "Not Disclosed by Recruiter" that is the norm for NA in payrate
This section seems to clean and organise the input data, while also investigate its basic quality/distribution.
glimpse(inputdt)
## Observations: 22,000
## Variables: 14
## $ company <chr> "MM Media Pvt Ltd", "find live infotech", ...
## $ education <chr> "UG: B.Tech/B.E. - Any Specialization PG:A...
## $ experience <chr> "0 - 1 yrs", "0 - 0 yrs", "4 - 8 yrs", "11...
## $ industry <chr> "Media / Entertainment / Internet", "Adver...
## $ jobdescription <chr> "Job Description  Send me Jobs like this...
## $ jobid <dbl> 210516002263, 210516002391, 101016900534, ...
## $ joblocation_address <chr> "Chennai", "Chennai", "Bengaluru", "Mumbai...
## $ jobtitle <chr> "Walkin Data Entry Operator (night Shift)"...
## $ numberofpositions <int> NA, 60, NA, NA, 4, NA, 2, 20, 2, NA, NA, N...
## $ payrate <chr> "1,50,000 - 2,25,000 P.A", "1,50,000 - 2,5...
## $ postdate <chr> "2016-05-21 19:30:00 +0000", "2016-05-21 1...
## $ site_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ skills <chr> "ITES", "Marketing", "IT Software - Applic...
## $ uniq_id <chr> "43b19632647068535437c774b6ca6cf8", "d4c72...
summary(as.factor(inputdt$site_name))
## www.naukri.com NA's
## 3987 18013
inputdt$site_name <- NULL
Well, it seems like the site name is pretty filled with either naukri or NA data. This column could probably be removed without influencing the subsequent data analysis.
uniqueN(inputdt$uniq_id)
## [1] 22000
This Would be useful as an unique identifier for individual job listing.
uniqueN(inputdt, by=c("jobid"))
## [1] 21910
It is not exactly clear how “jobid” is structed, perhaps it accounts of relisting of jobs?
Although, since the number of unique jobid is only differing from uniq_id by 90, if the purpose where to depict the relisting of respective jobs, it probably did not do its job very well.
# Seeing that payrate listing would corresponds to format of "1,50,000 - 2,50,000 P.A"
# I would just use gsub to remove strings after P.A
inputdt$payrate.M<-gsub( " P.A.*$", "", inputdt$payrate)
# this is essential to remove numerical characters after the P.A string
# this also removed certain listed incentive/commission information in payrate
# stripping all the non numerical chr from the payrate
inputdt$payrate.M<-gsub("[^0-9-]", "", inputdt$payrate.M)
# inputdt$m.payrate<-gsub("[^0-9,-,-]", "", inputdt$m.payrate)
# Outliers
# Example of some payrate that is not extracated properly ( with 2 "-" etc after regexp treatment),
# seeing there is no easy solution and that there is only ~200 of them, them will be ignored for now
inputdt[lengths(strsplit(inputdt$payrate.M, "-"))>2, list(payrate,payrate.M),]
## # A tibble: 197 x 2
## payrate
## <chr>
## 1 Education- BDS, Be/b.tech [bio-chemistry, Bio-technology, Biomedical], Bams
## 2 PICK YOUR Job-tick YOUR Career-we Click IT FOR YOU! Yogam IS YOUR Career Co
## 3 Consolidated Pay : Rs.40,000-44,000-52,000- 56,000-60,000-64,000/- p.m. plu
## 4 Monthly Salary: Pay Scale of Rs.15600- 39100/- + Grade Pay Rs.6600/- + NPA.
## 5 Salary: RS.(52,000-5,000 - 77,000)
## 6 Pay Scale: PB-4 (Rs.37400-67000) with AGP of Rs.9500/- p.m.For direct recru
## 7 Pay Scale: PB-4 (Rs.37400-67000) with AGP of Rs.9500/- p.m.
## 8 Pay Scale: PB-4 (Rs.37400-67000) with AGP of Rs.9500/- p.m.For direct recru
## 9 Pay Scale: PB-4 (Rs.37400-67000) with AGP of Rs.9500/- p.m. For direct recr
## 10 Pay Scale: PB-3 (Rs.I5600-39100) with AGP of Rs.8000/- p.m.
## # ... with 187 more rows, and 1 more variables: payrate.M <chr>
COuld potentially improve the regular expresssion by extracting only numbers between the dashes, but that would require more work with regeexp, going to put it off for now.
Splitting the payrate column while conditionally ignoring those that have more than dash (“-”) symbol.
# Identifying the appropriate number of splits for payrate with str "-"
splits <- max(lengths(strsplit(inputdt$payrate.M, "-")))
# Spliting the payrate into m.payrate1 and m.payrate2, Ignoring those column with additional splits
mdt <- inputdt[lengths(strsplit(inputdt$payrate.M, "-"))<=2,
paste0("m.payrate", 1:2) := tstrsplit(payrate.M, "-", fixed=TRUE)][]
# Changing the column into numerics
mdt$m.payrate1<- as.numeric(mdt$m.payrate1)
mdt$m.payrate2<- as.numeric(mdt$m.payrate2)
# account for case where there is only one listed payrate instead of a range
mdt<-mdt[is.na(m.payrate2), m.payrate2:=m.payrate1]
# Creating a column that indicate the mean of the payrate, would be easier on plotting etc
# na.rm = T to enable those listing that only depicted single number of payrate to have a mean
mdt[, m.payrate.Mean := rowMeans(.SD), by = uniq_id, .SDcols = c("m.payrate1", "m.payrate2")]
The mean value and median value of the payrate should almost be equivalent given that in this case, they are both determined from the listed upper and lower window of payrate.
# Checking if the values in experience1 are always lower than experience2,
# such that experience1 can be considered as a lower limit for the job, vice versa for experience2
# summary(mdt$m.payrate1>mdt$m.payrate2)
# Mode FALSE TRUE NA's
# logical 2 4743 17255
# listing the data with m.payrate1 > m.payrate2
mdt[m.payrate1>m.payrate2,list(payrate,payrate.M,m.payrate1,m.payrate2)]
## # A tibble: 2 x 4
## payrate payrate.M m.payrate1
## <chr> <chr> <dbl>
## 1 Max.package15.5 Lpa... 25-30% hike on CTC 15525-30 15525
## 2 55,00,000 - 1 Cr & above P.A 5500000-1 5500000
## # ... with 1 more variables: m.payrate2 <dbl>
In both of these cases, the payrate is not well formated and extracted. There is probably some bad extraction in the extracting of payrate that is not depicted here as well as they may be hidden behind m.payrate2 > m.payrate1.
Removing these cases (where m.payrate1>m.payrate2) by setting them to NA.
mdt[(m.payrate1>m.payrate2),`:=`(payrate.M= NA,
m.payrate1= NA,
m.payrate2= NA,
m.payrate.Mean= NA)]
# The majority of the records has no payrate listed.
# summary(mdt$m.payrate.Mean)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 9.000e+00 2.500e+05 3.750e+05 3.672e+09 8.000e+05 1.740e+13 17255
# The maximum figure also points to likely error in extracation
# after attempting a few threshold of cut, I find 9e+6 to be most suitable, listed below are a table depicted payrate after this threshold.
mdt[m.payrate.Mean>9e+6] %>%
.[,list(payrate, m.payrate.Mean)]
## # A tibble: 359 x 2
## payrate
## <chr>
## 1 Salary structure is: 32,000 INR (for thesis submitted), 36,000 INR (PhD), 3
## 2 10000 to 45000
## 3 Pay Scale: INR Rs.37400-67000 + Grade Pay Rs.10000/-
## 4 Pay : CDA Pay Scale : 9300 - 34800 + GP 4800 Gross pay including allowances
## 5 Pay Band: PB4 (Rs.3740067000) with Academic Grade Pay of Rs.9,500/p.m.
## 6 Pay Band: PB4 (Rs.3740067000) with Academic Grade Pay of Rs.9,500/p.m.
## 7 Pay Band: PB4 (Rs.3740067000) with Academic Grade Pay of Rs.9,500/p.m.
## 8 Pay Band: PB4 (Rs.3740067000) with Academic Grade Pay of Rs.10,500/p.m.
## 9 Pay Band: PB4 (Rs.3740067000) with Academic Grade Pay of Rs.9,500/p.m.
## 10 Pay Band: PB4 (Rs.3740067000) with Academic Grade Pay of Rs.9,500/p.m.
## # ... with 349 more rows, and 1 more variables: m.payrate.Mean <dbl>
In fact, it almost seems like the posting of payrate = “Pay Band: PB3 (Rs.15600-39100) with Academic Grade Pay of Rs.8,000/p. m.” are just reposting seeing about 80 listing of jobs are having identifical payrate.
# for an easy way out, it seems just removing all extracted payrate data for these badly extracted payrate will do. ie:
mdt[m.payrate.Mean>9e+6,`:=`(payrate.M= NA,
m.payrate1= NA,
m.payrate2= NA,
m.payrate.Mean= NA)]
#total count of valid payrate records
N_valid_payrate<-sum(!is.na(mdt$m.payrate.Mean))
payrate_record_percentage <- sum(!is.na(mdt$m.payrate.Mean))/sum((mdt$m.payrate.Mean), na.rm=T)
Sadly, there is only payrate_record_percentage of the job listing that has its payrate listed and extracted so fr, that would be a mere N_valid_payrate records.
# all_values <- function(x) {
# if(is.null(x)) return(NULL)
# paste0(names(x), ": ", format(x), collapse = "<br />")
# }
mdt[!is.na(m.payrate.Mean),][m.payrate.Mean<9e+6] %>%
melt(., measure.vars = patterns("^m.payrate")) %>%
ggvis(~value, fill = ~variable) %>%
group_by(variable) %>%
layer_densities() %>%
add_axis("x", title = "Payrate Per Annum") %>%
add_axis("y", title = "Density")
# Oddly I couldn't find documented (and not dirty) approach in inserting a plot title with ggvis. I will just leave it as it is for now.
Would be interesting to see the how the payrate woudl vary per industry/skill/location in a later section. Though, if we narrow down our windowing.
mdt[!is.na(m.payrate.Mean),][m.payrate.Mean<2e+6] %>%
melt(., measure.vars = patterns("^m.payrate")) %>%
ggvis(~value, fill = ~variable) %>%
group_by(variable) %>%
layer_densities() %>%
add_axis("x", title = "Payrate Per Annum") %>%
add_axis("y", title = "Density")
Well it seems that for the payrate of the listed jobs, the median of listed salary is just about ~300k repee PA ( since the jobboard seems to be based in india).
head(mdt$education,5)
## [1] "UG: B.Tech/B.E. - Any Specialization PG:Any Postgraduate - Any Specialization, Post Graduation Not Required"
## [2] "UG: B.Tech/B.E. - Any Specialization PG:MBA/PGDM - Any Specialization"
## [3] "UG: Any Graduate - Any Specialization PG:Any Postgraduate Doctorate:Doctorate Not Required"
## [4] "UG: Any Graduate - Any Specialization PG:CA Doctorate:Doctorate Not Required"
## [5] "UG: B.Tech/B.E. - Any Specialization PG:Any Postgraduate - Any Specialization, MCA - Computers, M.Tech - Any Specialization Doctorate:Any Doctorate - Any Specialization"
It can be seen ( after skimming through multiple lines)that Education can be split into different field quite accurately with delimiter of “UG:”, “PG:” and “Doctorate:”. Also, in some case, the education field can be listed, yet are noted as “Not Required”
mdt %>%
.[grep("Diploma",education),m.Education.tf.Diploma := TRUE] %>% #Diploma
.[,m.Education.UG :=trimws(str_extract(education, "(?<=UG: ).*(?=PG:)|(?<=UG: ).*$"))] %>%
.[grep("Not Required",m.Education.UG),m.Education.UG :=NA] %>% #Removing "not required"
.[!is.na(m.Education.UG),m.Education.tf.UG:=TRUE] %>%
.[,m.Education.PG :=trimws(str_extract(education, "(?<=PG:).*(?=Doctorate:)|(?<=PG:).*$"))] %>%
.[grep("MBA",m.Education.PG),m.Education.tf.MBA := TRUE] %>% #MBA
.[grep("Not Required",m.Education.PG),m.Education.PG :=NA] %>%
.[!is.na(m.Education.PG),m.Education.tf.PG:=TRUE] %>%
.[,m.Education.DR :=trimws(str_extract(education, "(?<=Doctorate:).*$"))] %>%
.[grep("Not Required",m.Education.DR),m.Education.DR :=NA] %>%
.[!is.na(m.Education.DR),m.Education.tf.DR:=TRUE]
Note that I also created some “m.Education.tf.x” (where x= UG, Diploma, PG, MBA, DR ) to indicate the need for designated education level in True or NA.
In some case, such as
mdt[(33:35)]$m.Education.UG
## [1] "Any Graduate - Any Specialization, B.Sc - Any Specialization, Maths, Physics, B.Com - Commerce"
## [2] "Any Graduate - Any Specialization"
## [3] "B.Tech/B.E. - Bio-Chemistry/Bio-Technology, Biomedical, Chemical, B.Pharma - Pharmacy"
Additional split on “,” will allow more details to be extracted, but I will left this alone for now. Going further would probably require diving deep into NLP or pulling a dictionary (to account for shortforms and synonyms).
mdt %>%
.[,lapply(.SD, sum, na.rm=TRUE) ,.SDcols= (names(mdt) %like% "^m.Education.tf*")]
## # A tibble: 1 x 5
## m.Education.tf.Diploma m.Education.tf.UG m.Education.tf.MBA
## <int> <int> <int>
## 1 669 17071 3182
## # ... with 2 more variables: m.Education.tf.PG <int>,
## # m.Education.tf.DR <int>
Note that the number of job listing table on top allowed mulitple entries of job for each different ncategories of education. For example, A job that listed education requirement as UG: and PG: would be counted as 1 in both categories.
The following codes created a section that would only attach the highest recorded education for everyt individual job listing. Ie, if the job listed for “UG: B.Sc, PG: Masters”, it would be recorded as Master level job in this column. I assume the level of education follow a nested/linear path that corresponds to: Diploma > UG > MBA > PG > DR. The position of MBA is perhaps slightly controversial.
# This list of data table chain simply utilise grep(str_detect) and is.na( the existance of UG, PG, DR) in previous education columns to assign a unique factor to the column
mdt %>%
.[grep("Diploma",education),m.Education.factor := "Diploma"] %>%
.[!is.na(m.Education.UG),m.Education.factor := "UG"] %>%
.[grep("MBA",education),m.Education.factor := "MBA"] %>%
.[!is.na(m.Education.PG),m.Education.factor := "PG"] %>%
.[!is.na(m.Education.DR),m.Education.factor := "DR"]
mdt$m.Education.factor <- factor(mdt$m.Education.factor)
# There outough to be better methods to melt, extract the highest education requirement information but I failed to extracted/melt it cleanly from the previous extracted information.
#head of records in skill
head(mdt$skills)
## [1] "ITES"
## [2] "Marketing"
## [3] "IT Software - Application Programming"
## [4] "Accounts"
## [5] "IT Software - Application Programming"
## [6] "IT Software - Application Programming"
# unique(inputdt$skills)
listing of top 10 skill in demand
mdt[!is.na(skills),.N,by=list(skills)] %>%
.[,.SD[order(-N)]] %>%
.[, head(.SD, 10),] %>%
.[,value:=fct_inorder(factor(skills))] %>% # fct_infreq from "forcats" to reoder the factor in x axis
ggvis(x=~skills, y=~N) %>%
layer_bars(fillOpacity := 0.1) %>%
add_axis("x", properties = axis_props(
labels = list(angle = 45, align = "left", fontSize = 10)
))
## Warning: package 'bindrcpp' was built under R version 3.3.3
This shows that the most demanding skills are IT Software - Application Programming, followed by Sales and ITES.
mdt$skills <- as.factor(mdt$skills)
# Checking whether every val id entry has "yrs" in the records,
# summary(!is.na(mdt$industry))
# Mode FALSE TRUE NA's
# logical 5 21995 0
# Majority of the job listing did listed valid industry
head(mdt$industry)
## [1] "Media / Entertainment / Internet"
## [2] "Advertising / PR / MR / Event Management"
## [3] "IT-Software / Software Services"
## [4] "Banking / Financial Services / Broking"
## [5] "IT-Software / Software Services"
## [6] "IT-Software / Software Services"
# since there is limited number of combination of industry, this column should be set as a factor
mdt$industry <- as.factor(mdt$industry)
It seems most job listing are well arranged, it is almost listed in tree like structure that.
# # splitting the mdt$industry with "/" into industryS1, industryS2 etc
# splits <- max(lengths(strsplit(mdt$industry, "/")))
# mdt <- mdt[,paste0("industryS", 1:splits) := tstrsplit(industry, " / ", fixed=TRUE)][]
#
#
# mdt[, industry_combined := paste(sort(.SD), collapse = " / "), by = 1:nrow(mdt),
# .SDcols = names(mdt) %like% "^industryS"]
# head(mdt$industry_combined)
#
# mdt[, .N, by= industry_combined] %>%
# .[,.SD[order(-N)]]
mdt[, .N, by= industry] %>%
.[,.SD[order(-N)]]
## # A tibble: 64 x 2
## industry N
## <fctr> <int>
## 1 IT-Software / Software Services 9216
## 2 Education / Teaching / Training 1322
## 3 BPO / Call Centre / ITES 1254
## 4 Banking / Financial Services / Broking 1238
## 5 Recruitment / Staffing 1129
## 6 Internet / Ecommerce 738
## 7 Pharma / Biotech / Clinical Research 525
## 8 Medical / Healthcare / Hospitals 495
## 9 Automobile / Auto Anciliary / Auto Components 478
## 10 Construction / Engineering / Cement / Metals 449
## # ... with 54 more rows
## note that this would also duplicate job listing with multiple industry for certain records, since some of the job listing are listed to have multiple industry
# mdt.melt.industry <- melt(mdt, measure.vars = patterns("^industryS"))
#
# mdt.melt.industry$value <- trimws(mdt.melt.industry$value)
#
# mdt.melt.industry %>%
# .[!is.na(value),.N,by=list(value)]%>%
# .[,.SD[order(-N)]] %>%
# .[, head(.SD, 10),] %>%
# .[,value:=fct_inorder(factor(value))] %>% # fct_infreq from "forcats" to reoder the factor in x axis
# ggvis(x=~value, y=~N) %>%
# layer_bars(fillOpacity := 0.1)%>%
# add_axis("x", title = "Industry",properties = axis_props(
# labels = list(angle = 45, align = "left", fontSize = 10)
# )) %>%
# add_axis("y", title = "Number of posting")
mdt[, .N, by= industry] %>%
.[,.SD[order(-N)]] %>%
.[, head(.SD, 10),] %>%
.[,industry:=fct_inorder(factor(industry))] %>% # fct_infreq from "forcats" to reoder the factor in x axis
ggvis(x=~industry, y=~N) %>%
layer_bars(fillOpacity := 0.1) %>%
add_axis("x", title = "Industry",properties = axis_props(
labels = list(angle = 45, align = "left", fontSize = 10)
)) %>%
add_axis("y", title = "Number of posting")
# mdt.melt.industry[!is.na(value),.N,by=list(value)] %>%
# .[,.SD[order(-N)]] %>%
# .[, tail(.SD, 10),]%>%
# .[,value:=fct_inorder(factor(value))] %>% # fct_infreq from "forcats" to reoder the factor in x axis
# ggvis(x=~as.factor(value), y=~N) %>%
# layer_bars(fillOpacity := 0.1)%>%
# add_axis("x", title = "Industry") %>%
# add_axis("y", title = "Number of posting")
mdt[, .N, by= industry] %>%
.[,.SD[order(-N)]] %>%
.[!is.na(industry), tail(.SD, 10),] %>%
.[,industry:=fct_inorder(factor(industry))] %>% # fct_infreq from "forcats" to reoder the factor in x axis
ggvis(x=~industry, y=~N) %>%
layer_bars(fillOpacity := 0.1) %>%
add_axis("x", title = "Industry",properties = axis_props(
labels = list(angle = 45, align = "left", fontSize = 10)
)) %>%
add_axis("y", title = "Number of posting")
# dt[!str_detect(dt$experience, "yrs")]# It seems all the listing are in a matter of years
# one entry with double dash "-"
# glimpse(dt[max(lengths(strsplit(dt$experience, "-")))==lengths(strsplit(dt$experience, "-"))])
mdt[max(lengths(strsplit(mdt$experience, "-")))==lengths(strsplit(mdt$experience, "-"))]$experience <- "1 - 3 yrs"
# Removing the last 3 char - "yrs" in the experience records
# nchar(mdt$experience, allowNA = TRUE)-3
mdt$experience.M <- str_sub(mdt$experience, 1,nchar(mdt$experience, allowNA = TRUE)-4)
# Splitting the experience
splits <- max(lengths(strsplit(mdt$experience.M, "-")))
mdt <- mdt[,paste0("experience", 1:splits) := tstrsplit(experience.M, "-", fixed=TRUE)][]
# Changing them to appropriate class
mdt$experience1<-as.numeric(mdt$experience1)
mdt$experience2<-as.numeric(mdt$experience2)
# Create a columm for median of experience
mdt[, m.experience.Median:=median(c(experience1, experience2)), by = uniq_id]
# Checking if the values in experience1 are always lower than experience2,
# such that experience1 can be considered as a lower limit for the job, vice versa for experience2
# summary(mdt$experience1<=mdt$experience2)
# Mode TRUE NA's
# logical 21885 115
# renaming the experience1 and experience2 column into more obvious form
setnames(mdt, c("experience1","experience2"), c("m.experience.L", "m.experience.U"))
mdt[!is.na(m.experience.L),]%>%
melt(., measure.vars = patterns("^m.experience")) %>%
ggvis(~value, fill = ~variable) %>%
group_by(variable) %>%
layer_densities(adjust = 2) %>% #adjust parameter to smooth the wiggliness of density
add_axis("x", title = "Number of years") %>%
add_axis("y", title = "Density")
mdt<-mdt[, m.experience.Cut := cut(m.experience.Median,
breaks=c(-Inf, 3*(0:3), 10+4*(0:2), Inf), right=F)]
mdt[is.na(m.experience.Cut)]$m.experience.Cut <- "Not listed"
# By default NA ( those not in the range of the cut are outside our boundaries and not counted. Started the break with -Inf to include them in the calculation
class(mdt$numberofpositions)
## [1] "integer"
summary(mdt$numberofpositions)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 2.00 4.00 45.12 10.00 2000.00 17536
Intestingly, there are job posting listed to have 2000 positions available, a quick look at this outlier.
mdt[numberofpositions==2000]
## # A tibble: 2 x 31
## company
## <chr>
## 1 HCL Technologies Limited
## 2 HCL Technologies Limited
## # ... with 30 more variables: education <chr>, experience <chr>,
## # industry <fctr>, jobdescription <chr>, jobid <dbl>,
## # joblocation_address <chr>, jobtitle <chr>, numberofpositions <int>,
## # payrate <chr>, postdate <chr>, skills <fctr>, uniq_id <chr>,
## # payrate.M <chr>, m.payrate1 <dbl>, m.payrate2 <dbl>,
## # m.payrate.Mean <dbl>, m.Education.tf.Diploma <lgl>,
## # m.Education.UG <chr>, m.Education.tf.UG <lgl>, m.Education.PG <chr>,
## # m.Education.tf.MBA <lgl>, m.Education.tf.PG <lgl>,
## # m.Education.DR <chr>, m.Education.tf.DR <lgl>,
## # m.Education.factor <fctr>, experience.M <chr>, m.experience.L <dbl>,
## # m.experience.U <dbl>, m.experience.Median <dbl>,
## # m.experience.Cut <fctr>
The majority of the job listing doesn’t list a number of the number of positions, I think it is reasonable to assump those without listed number in ‘Number of position’ as 1. This may not be entirely accurate but it will manipulation with other data in later section.
mdt<-mdt%>%
.[, m.Npositions:=numberofpositions] %>%
.[is.na(numberofpositions), m.Npositions:=1]
summary(mdt$m.Npositions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 9.952 1.000 2000.000
Now all recorded number of positions have a valid number.
split and time zone
head(mdt$postdate)
## [1] "2016-05-21 19:30:00 +0000" "2016-05-21 19:30:00 +0000"
## [3] "2016-10-13 16:20:55 +0000" "2016-10-13 16:20:55 +0000"
## [5] "2016-10-13 16:20:55 +0000" "2016-10-13 16:20:55 +0000"
#str split the postdate based on format corresponds to "2016-05-21 19:30:00 +0000" on "+"
mdt[, c("postdate_time","postdate_timezone") := tstrsplit(postdate,"+",2)]
# summary(as.factor(mdt$postdate_timezone))
# 0000 NA's
# 21977 23
# it seems timezone is pretty meaningless in the data base too
# Removing timezone column
mdt$postdate_timezone <- NULL
formatting
mdt$postdate_time <- parse_date_time(mdt$postdate_time, "%Y-%m-%d H:M:S")
#ignoring timezone for now, shouldn't make any difference too
min(mdt$postdate_time, na.rm =T)
## [1] "2015-01-24 19:48:31 UTC"
max(mdt$postdate_time, na.rm =T)
## [1] "2017-01-11 21:00:00 UTC"
#
mdt[, m.timeCut := cut(as.POSIXct(postdate_time), breaks="1 month")]
mdt[!is.na(m.timeCut), .N, by=m.timeCut] %>%
ggvis(~m.timeCut, ~N) %>%
layer_bars(fillOpacity := 0.1) %>%
add_axis("x", properties = axis_props(
labels = list(angle = 45, align = "left", fontSize = 10)
))
# alternative data table method
# mdt[, .N, by=month(postdate_time)]
# alternate dplyr solution
# mdt$uniq_id
# mdt %>%
# group_by(yr = year(postdate_time), mon = month(postdate_time)) %>%
# summarise(mn_amt = length(uniq_id))
If we assume that this captured all the job listing over 2016 to 2017 (probably unlikely given the huge variance over the period and that there are months where there is little to no listing), it would seems that most jobs are posted in March in 2016 ( Start of year-ish), followed by Nov 2015 (End of year). I wonder if this coincide with the timing of graduation/bonus etc.
Finally we are at location, I have to admit that the reason I went ahead with ggvis is predominaly because of their map package.
head(mdt$joblocation_address,5)
## [1] "Chennai"
## [2] "Chennai"
## [3] "Bengaluru"
## [4] "Mumbai, Bengaluru, Kolkata, Chennai, Coimbatore, Tamilnadu"
## [5] "Bengaluru"
Some job listing are listed at multiple location, such as “Delhi NCR, Mumbai, Bengaluru, Kochi, Greater Noida, Gurgaon, Hyderabad, Kozhikode, Lucknow”. Hence, a strsplit with arguements for “,”, “/”, should do the job.
# Removing str in parenthesis as this information is redundant for our purpose of data exploration
mdt$joblocation_address <- gsub("\\(.*\\)","", mdt$joblocation_address)
# splitting by ",", "/", and " "
splits <- max(lengths(strsplit(mdt$joblocation_address, ", |/")))
mdt <- mdt[,paste0("m.locationS", 1:splits) := tstrsplit(joblocation_address, ", |/")]
mdt$joblocation_address<- NULL # removing the original data
# some joblocations appears to use space, " " as delimiter...
# Obviously, implentating strsplit on " " would introduce additional issues (ie location name with space. Thankfully this phenomena is rathe rare, I would just ignore " " as a delimiter for now.
# this will also allow job with multiple location listing to be listed to different location (duplicates), for example
mdt.melt.location <- melt(mdt, measure.vars = patterns("^m.locationS"), value.name = "m.i.location")
# where m.i.location indicates modified and individual extracted location
#removing variable that just indicate the order of the "m.locationSX" in Location
mdt.melt.location$variable<- NULL
# Setting empty strings to NA
mdt.melt.location[m.i.location==""] <- NA
# Trimming white space from the value
mdt.melt.location$m.i.location <- trimws(mdt.melt.location$m.i.location)
# plotting most popular location for job listing
mdt.melt.location %>%
.[!is.na(m.i.location),.N,by=list(m.i.location)] %>%
.[,.SD[order(-N)]] %>%
.[, head(.SD, 10),] %>%
.[,value:=fct_inorder(factor(m.i.location))] %>% # fct_infreq from "forcats"
ggvis(x=~m.i.location, y=~N) %>%
layer_bars(fillOpacity := 0.1) %>%
add_axis("x", title = "location",properties = axis_props(
labels = list(angle = 45, align = "left", fontSize = 10)
)) %>%
add_axis("y", title = "Number of posting")
Note that this is not extensive, sinec the job listings could have a slightly varaible in the name and it would be registered as a valid count. Ie, location such as “India-Karnataka-Bangalore”are listed as seperated instance with “Bangalore”.
Regardless, it seems Bengaluru, Bangolore and Mumbai are among the most popular job location.
## In this section, I used geocodes to query google for the lat and lon of respective locations.
# this will be done only with top 100 locations by their respective popularity
# which incidently also only inlcudes location with more than ~10 counts of job listings.
mdt.melt.location.count.t100 <- mdt.melt.location %>%
.[!is.na(m.i.location),.N,by=list(m.i.location)] %>%
.[,.SD[order(-N)]] %>%
.[, head(.SD, 100),]
# making a list of file in current working directory
filelist <- list.files()
# mydat exist, read it , otherwise, geocodes to find lat lon for the locations
if(any(filelist=="mydat.rds")){
#read the created .rds containing the require data
mydat <- readRDS("mydat.rds")
}else{
# using geocodes ( part of ggmap package) to find the lat and lon for the top 100 location
# perhaps not the cleanest way, some of the location will not be the most accurate.
geocodes <- geocode(as.character(mdt.melt.location.count.t100$m.i.location))
# might be better to batch feed it to avoid over query limit,
# which is an issue for network with good connection
mydat <- data.frame(mdt.melt.location.count.t100[,1:2],geocodes)
saveRDS(mydat, "mydat.rds")
}
# # The locations that have absurd lon and lat are to be removed
# mydat.filtered <- setDT(mydat)[lon>60&lon<90] %>%
# .[lat>5&lat<40]
# India longitudes and latitude from google
india = c(lon =78.96, lat = 20.59)
# # loading the mapdata
# india.map = get_map(location = india, zoom = 5)
#
# # plotting with ggmap countour plot
# ggmap(india.map, extent = "panel", maprange=FALSE)+
# geom_density2d(data = mydat.filtered, aes(x = lon, y = lat)) +
# scale_fill_gradient(low = "green", high = "red") +
# scale_alpha(range = c(0.00, 0.25), guide = FALSE) +
# theme(legend.position = "none", axis.title = element_blank(), text = element_text(size = 12))
Via join with mapdata
mapdata <- get_data_from_map(download_map_data("countries/in/in-all"))
# glimpse(mapdata)
# Observations: 34
# Variables: 20
# $ hc-group <chr> "admin1", "admin1", "admin1", "admin1", "admin1", "admin1", "admin1", "adm...
# $ hc-middle-x <dbl> 0.65, 0.59, 0.50, 0.56, 0.46, 0.46, 0.51, 0.59, 0.47, 0.56, 0.60, 0.63, 0....
# $ hc-middle-y <dbl> 0.81, 0.63, 0.74, 0.38, 0.64, 0.51, 0.34, 0.41, 0.60, 0.32, 0.47, 0.55, 0....
# $ hc-key <chr> "in-py", "in-ld", "in-wb", "in-or", "in-br", "in-sk", "in-ct", "in-tn", "i...
# $ hc-a2 <chr> "PY", "LD", "WB", "OR", "BR", "SK", "CT", "TN", "MP", "GU", "GA", "NL", "M...
# $ labelrank <chr> "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2",...
# $ hasc <chr> "IN.PY", "IN.LD", "IN.WB", "IN.OR", "IN.BR", "IN.SK", "IN.CT", "IN.", "IN....
# $ alt-name <chr> "Pondicherry|Puduchcheri|Pondichéry", "Ã\u008dles Laquedives|Laccadive|Mi...
# $ woe-id <chr> "20070459", "2345748", "2345761", "2345755", "2345742", "2345762", "200704...
# $ fips <chr> "IN22", "IN14", "IN28", "IN21", "IN34", "IN29", "IN37", "IN22", "IN35", "I...
# $ postal-code <chr> "PY", "LD", "WB", "OR", "BR", "SK", "CT", "TN", "MP", NA, "GA", "NL", "MN"...
# $ name <chr> "Puducherry", "Lakshadweep", "West Bengal", "Orissa", "Bihar", "Sikkim", "...
# $ country <chr> "India", "India", "India", "India", "India", "India", "India", "India", "I...
# $ type-en <chr> "Union Territory", "Union Territory", "State", "State", "State", "State", ...
# $ region <chr> "South", "South", "East", "East", "East", "East", "Central", "South", "Cen...
# $ longitude <chr> "79.7758", "72.7811", "87.7289", "84.4341", "85.8134", "88.4482", "82.3069...
# $ woe-name <chr> "Puducherry", "Lakshadweep", "West Bengal", "Orissa", "Bihar", "Sikkim", "...
# $ latitude <chr> "10.9224", "11.2249", "23.0523", "20.625", "25.6853", "27.5709", "21.8044"...
# $ woe-label <chr> "Puducherry, IN, India", "Lakshadweep, IN, India", "West Bengal, IN, India...
# $ type <chr> "Union Territor", "Union Territor", "State", "State", "State", "State", "S...
DT1 <- setDT(mydat)[, c("lon", "lat"), with = FALSE] %>%
.[lon>60&lon<90] %>%
.[lat>5&lat<30] %>%
unique(., by = c('lon','lat'))
DT2 <- setDT(mapdata)[, c("longitude", "latitude", "hc-a2"), with = FALSE] %>%
.[, c("longitude","latitude"):=lapply(.SD, as.numeric), .SDcol=c("longitude","latitude")]
colnames(DT2) <- c("lon","lat", "hca2")
# checking the uniqueness
# DT1[,.N,c("lon","lat")][N>1L]
# DT2[,.N,c("longitude","latitude")][N>1L]
# Turns out joining on nearest value based on two column isn't trivial,
# I have adapted a fuction from stackoverflow
# https://stackoverflow.com/questions/28435126/merge-data-table-by-two-nearest-variables?noredirect=1&lq=1
func = function(u,v)
{
vec = with(DT2, (u-lon)^2 + (v-lat)^2)
DT2[which.min(vec),]$hca2
}
t.join.data <- transform(DT1, 'hc-a2'=apply(DT1, 1, function(u) func(u[[1]], u[[2]]))) %>%
.[mydat, on=.(lon,lat)] %>%
.[!is.na('hc-a2'), value:=sum(N), by="hc-a2"] %>%
unique(., by = c('hc-a2')) %>%
.[lon>60&lon<90] %>%
.[lat>5&lat<30]
# hcmap("countries/in/in-all", data = t.join.data, value = "value",
# joinBy = c("hc-a2", "hc-a2"), name = "Number of job listing",
# dataLabels = list(enabled = TRUE, format = '{point.name}'), borderWidth = 1,
# tooltip = list(valueDecimals = 0, valuePrefix = "", valueSuffix = ""))
Sadly I dont see an easy way to join the mapdata inforamtion together with the job listing information to provide state/region level information. i guess the likes of Mumbai, Bengaluru, Bangalore etc are actually cities name.
I resort to plotting them with the geocoded latitude and longitude with highchart, which depicted similar data to the ggmap in previous section.
# Setting z,color and name parameter for plotting with highcharter
highcharter.plot <- setDT(mydat)[lon>60&lon<90] %>%
.[lat>5&lat<30]%>%
.[,`:=`(
name=m.i.location,
z=N,
color= colorize(N)
)]
# plotting with highcharter
hcmap("countries/in/in-all", name = "India") %>%
hc_add_series(data = highcharter.plot, type = "mapbubble",
name = "Job listing", maxSize = "35") %>%
hc_title(text = "Locations of Job Posting") %>%
hc_subtitle(text = "only top 100 locations are included, the counts also include duplicates if the job is listed at multiple location") %>%
hc_mapNavigation(enabled = TRUE)
Output of jobs description column
head(mdt$jobdescription, 1)
## [1] "Job Description  Send me Jobs like this Qualifications: - == > 10th To Graduation & Any Skill: - == > Basic Computer Knowledge Job Requirement : - == > System or Laptop Type of job: - == > Full Time or Part time Languages : - == > Tamil & English. Experience : - == > Freshers & Experience payment details: - 1 form per day 5/- 10 form per day 50/- 100 form per day 500/- monthly you can earn 15000/- per month Selection Process: - == > Easy Selection Process,So What Are You Waiting For? Apply Now & Grab Best Opportunity To Make Your Carrier & To Improve Your Earing Skills. More detail contact Mr Hari 8678902528 9003010282 Salary:INR 1,50,000 - 2,25,000 P.A Industry: Media / Entertainment / Internet Functional Area: ITES , BPO , KPO , LPO , Customer Service , Operations Role Category:Other Role:Fresher Keyskills English Typing Part Time Data Entry Selection Process Desired Candidate Profile Education- UG: B.Tech/B.E. - Any Specialization PG:Any Postgraduate - Any Specialization, Post Graduation Not Required Please refer to the Job description above Company Profile: MM Media Pvt Ltd Mass Media International Find Live Info Pvt Ltd DATA SERVICES is established in the year 2012 with a vision of transforming an ordinary to extra ordinary thing. Find live info is one of our main project, it is a developing search engine site based at Chennai, India. With years of experience, we have crafted massive data collection job opportunities for public. FIND LIVE INFO a blend of experience, imagination, strategy and action to create digital experience that excite, challenge, inspire and engage users. FIND LIVE INFO provides other business opportunities which will be handled by experts of various fields. We have interesting combination of young and experienced Specialists Under the guidance of our Top Management we bloom towards success More detail contect 8678902528 Download PPT Photo 1  View Contact Details"
It seems this column data is not well formated. Some of the useful information such as payrate and indsutry is also already extracted into individual column by pre-crawler. Still, it would be interesting to see if we can indentify additional useful information for job description with some basic NLP processes.
It seems every job description begins with “Job Description  Send me Jobs like this”. This particular string will be removed before anyany processes.
## Removing the "Job Description  Send me Jobs like this"
mdt$jobdescription <- str_replace(mdt$jobdescription, "Job Description", "") %>%
str_replace(., "Send me Jobs like this", "")
# Checking the output
head(mdt$jobdescription, 1)
## [1] " Â Qualifications: - == > 10th To Graduation & Any Skill: - == > Basic Computer Knowledge Job Requirement : - == > System or Laptop Type of job: - == > Full Time or Part time Languages : - == > Tamil & English. Experience : - == > Freshers & Experience payment details: - 1 form per day 5/- 10 form per day 50/- 100 form per day 500/- monthly you can earn 15000/- per month Selection Process: - == > Easy Selection Process,So What Are You Waiting For? Apply Now & Grab Best Opportunity To Make Your Carrier & To Improve Your Earing Skills. More detail contact Mr Hari 8678902528 9003010282 Salary:INR 1,50,000 - 2,25,000 P.A Industry: Media / Entertainment / Internet Functional Area: ITES , BPO , KPO , LPO , Customer Service , Operations Role Category:Other Role:Fresher Keyskills English Typing Part Time Data Entry Selection Process Desired Candidate Profile Education- UG: B.Tech/B.E. - Any Specialization PG:Any Postgraduate - Any Specialization, Post Graduation Not Required Please refer to the Job description above Company Profile: MM Media Pvt Ltd Mass Media International Find Live Info Pvt Ltd DATA SERVICES is established in the year 2012 with a vision of transforming an ordinary to extra ordinary thing. Find live info is one of our main project, it is a developing search engine site based at Chennai, India. With years of experience, we have crafted massive data collection job opportunities for public. FIND LIVE INFO a blend of experience, imagination, strategy and action to create digital experience that excite, challenge, inspire and engage users. FIND LIVE INFO provides other business opportunities which will be handled by experts of various fields. We have interesting combination of young and experienced Specialists Under the guidance of our Top Management we bloom towards success More detail contect 8678902528 Download PPT Photo 1 Â View Contact Details"
require(tm)
require(quanteda)
require(wordcloud)
require(slam) # for row_sums
require(RColorBrewer)
### create Document Feature Matrix
dfm.txt <- corpus(mdt$jobdescription, docvars = data.frame(grp = mdt$industry)) %>%
dfm(., groups = "grp",
stem = T,
remove=stopwords("english"),
remove_symbols = T,
remove_punct = T,
remove_numbers = T,
remove_url = T)
dfm.txt
## Document-feature matrix of: 64 documents, 57,132 features (94.7% sparse).
## Data.Frame of Term frequency
dt.freq<-data.frame(Language = names(topfeatures(dfm.txt,20)),
N=topfeatures(dfm.txt,20))
## Barplot of Term frequency
setDT(dt.freq)[,Language:= fct_inorder(factor(Language))] %>%
ggvis(~Language, ~N) %>%
layer_bars(fillOpacity := 0.1) %>%
add_axis("x", title = "Programming Language",
properties = axis_props(labels = list(angle = 45, align = "left", fontSize = 10))) %>%
add_axis("y", title = "Number of instance Mentioned")
##warning =False in the knit/rmd to silence the warning of words that could not be fit on the page
## Rowsum to find individual term's frequency across all documents
term_frequency<-row_sums(t(dfm.txt))
## Sort term_frequency in descending order
term_frequency<- sort(term_frequency,decreasing = T)
## Plot a barchart of the 10 most common wordsm;'
barplot(term_frequency[1:10], col = "tan", las = 2)
## Create word_freqs
word_freqs <- data.frame(term = names(term_frequency), num = term_frequency)
set.seed(4211)
## Plot wordcloud
wordcloud(word_freqs$term, word_freqs$num, max.words = 100,
random.order=FALSE, colors=brewer.pal(8,"Dark2"), rot.per=.3, scale=c(3,.5))
Note that this only consider the 1st/primary industry of the job records
## Summarizing and identifying the most popular 8th industry
popular_industry_count <- mdt %>%
.[!is.na(industry),.N,by=list(industry)] %>%
.[,.SD[order(-N)]] %>%
.[, head(.SD, 8),]
# popular_industry_count$industryS1
# tfidf(wikiDfm)
textplot_wordcloud(dfm_select(dfm.txt, documents=popular_industry_count$industry[1:4]),
comparison = T, rot.per = .25, scale=c(2,.55),title.size =1)
textplot_wordcloud(dfm_select(dfm.txt, documents=popular_industry_count$industry[5:8]),
comparison = T, rot.per = .25, scale=c(2,.55),title.size =1)
## Making a dictionary list of common and popular string list of programming languages
myDict <- dictionary(list(git="git",
c = "c",
python = "python",
java="java",
cpp = "c++",
chash= "csharp",
r="r",
javascript="javascript",
php= "php",
go="go",
swift="swift"
))
# the later two langauge might have issue identifying the correct frequency of the word as the names can also double as common word.
## Applying the dictionary just defined on the job description column.
ProgrammingL_popularity <- dfm(mdt$jobdescription, dictionary = myDict)
## Topfeatures() to extract popular languages and its counts
dt.ProgrammingL<-data.frame(Language = names(topfeatures(ProgrammingL_popularity)),
N=topfeatures(ProgrammingL_popularity))
## fct_inorder to reorder the langauge with the
setDT(dt.ProgrammingL)[,Language:= fct_inorder(factor(Language))] %>%
ggvis(~Language, ~N) %>%
layer_bars(fillOpacity := 0.1) %>%
add_axis("x", title = "Programming Language",
properties = axis_props(labels = list(angle = 45, align = "left", fontSize = 15))) %>%
add_axis("y", title = "Number of instance Mentioned")
## To extract frequency of words
# freq <- textstat_frequency(ProgrammingL_popularity)
# "textstat_frequency" is only available into quentda 0.99, which doesn't seem to be available by default at the moment
Strangely, Java is the most popular languages, it was mentioned 5000 times in job description. It is follower by c and javascript, though they are only half as popular. R is also rather popular, even somehow being more popular than python too.
# Preserving only extracted data/useful data in the dt
mdt <- mdt %>%
.[,(colnames(.) %like% "^m."| colnames(.)=="industry"| colnames(.)=="uniq_id"),
with=FALSE]
# Another modified data table with with multiple listing of locations as multiple entries
mdt.melt.location <- mdt.melt.location %>%
.[,(colnames(.) %like% "^m."| colnames(.)=="industry"| colnames(.)=="uniq_id"), with=FALSE]
hcboxplot(x = mdt$m.payrate.Mean,
var = mdt$industry) %>%
hc_chart(type = "column") %>%
hc_title(text = "Plot of Payrate(Mean, PA) vs Industry")
If we were to ignore the outliers and raw data (which would be heavily influenced by the amount of job listing in that particular industry), it seems that “Iron and Steel”, “Pulp and Paper”, “IT-Software” and “Strategy/Management consulting Firm” command the best payrate.
Although, one would be mistaken to conclude straight away that that “Pul and Paper” and the likes are the best industry (Pay wise) to work in, as this ignore other factors such as number of job listing/payrate recorded, as well other not specifed requirement.
The table below listed the Median Pay of listing by different industries.
mdt[!is.na(m.payrate.Mean),.(.N,
Q1=quantile(m.payrate.Mean, 0.25),
PayMed=median(m.payrate.Mean),
Q3=quantile(m.payrate.Mean, 0.75)), by=industry] %>%
.[,.SD[order(-PayMed)]] %>%
.[, head(.SD, 30),]
## # A tibble: 30 x 5
## industry N Q1 PayMed Q3
## <fctr> <int> <dbl> <dbl> <dbl>
## 1 Pulp and Paper 2 784375 1231250 1678125
## 2 Strategy / Management Consulting Firms 35 487500 1000000 1475000
## 3 Iron and Steel 6 362500 975000 1475000
## 4 Ceramics / Sanitary ware 13 500000 800000 850000
## 5 Tyres 1 750000 750000 750000
## 6 Glass / Glassware 5 700000 725000 1100000
## 7 Telecom/ISP 15 312500 687500 1137500
## 8 Brewery / Distillery 3 487500 625000 912500
## 9 IT-Hardware & Networking 79 293750 600000 1025000
## 10 Water Treatment / Waste Management 4 468750 587500 675000
## # ... with 20 more rows
Evidently, “Pulp and Paper” current payrate data is derived from only two job listing, thus it’s pay scale is likely heavily skrewed; The abonormally high PayScale might be reserved for c-level employee. Also, we have zero clue with regards of how the data (which is a subset of the actual listing) scale when the full database are considered. Regardless, it is probably safe to assume that the current datasets are a random subset of the actual full datasets. Hence, having more listing of jobs in a particular industry would probably portray a clearer picture with regards to pay.
Thus, if we were to restrict our boxplot to industry with more than 30 listing.
t.filtered.mdt <- mdt[!is.na(m.payrate.Mean), if (.N > 30) .SD, by = industry]
hcboxplot(x = t.filtered.mdt$m.payrate.Mean,
var = t.filtered.mdt$industry) %>%
hc_chart(type = "column") %>%
hc_title(text = "Plot of Payrate(Mean, PA) vs Industry")
The clear winners are “IT - Sofreware”, “Internet” and “Strategy/Management consulting Firm”.
In the form of table:
mdt[!is.na(m.payrate.Mean), if (.N > 30) .SD, by = industry]%>%
.[!is.na(m.payrate.Mean), .(.N,
Q1=quantile(m.payrate.Mean, 0.25),
PayMed=median(m.payrate.Mean),
Q3=quantile(m.payrate.Mean, 0.75)), by=industry] %>%
.[,.SD[order(-PayMed)]] %>%
.[, head(.SD, 10),]
## # A tibble: 10 x 5
## industry N Q1 PayMed
## <fctr> <int> <dbl> <dbl>
## 1 Strategy / Management Consulting Firms 35 487500 1000000
## 2 IT-Hardware & Networking 79 293750 600000
## 3 IT-Software / Software Services 1249 337500 550000
## 4 Textiles / Garments / Accessories 51 312500 550000
## 5 FMCG / Foods / Beverage 82 265625 500000
## 6 Banking / Financial Services / Broking 279 300000 475000
## 7 Accounting / Finance 74 265625 450000
## 8 Oil and Gas / Energy / Power / Infrastructure 34 275000 406250
## 9 Media / Entertainment / Internet 53 250000 400000
## 10 Real Estate / Property 82 300000 400000
## # ... with 1 more variables: Q3 <dbl>
mdt[!is.na(m.payrate.Mean), if (.N > 30) .SD, by = industry]%>%
.[!is.na(m.payrate.Mean), .(.N,
Q1=quantile(m.payrate.Mean, 0.25),
PayMed=median(m.payrate.Mean),
Q3=quantile(m.payrate.Mean, 0.75)), by=industry] %>%
.[,.SD[order(-PayMed)]] %>%
.[, tail(.SD, 10),]
## # A tibble: 10 x 5
## industry N Q1 PayMed
## <fctr> <int> <dbl> <dbl>
## 1 Retail / Wholesale 61 225000 350000
## 2 Courier / Transportation / Freight / Warehousing 39 168750 350000
## 3 Travel / Hotels / Restaurants / Airlines / Railways 128 212500 318750
## 4 Other 77 175000 312500
## 5 Industrial Products / Heavy Machinery 75 206250 300000
## 6 Insurance 86 250000 300000
## 7 Medical / Healthcare / Hospitals 179 250000 300000
## 8 Education / Teaching / Training 138 203125 300000
## 9 BPO / Call Centre / ITES 741 212500 275000
## 10 Recruitment / Staffing 115 162500 225000
## # ... with 1 more variables: Q3 <dbl>
On the end of tial, some of the industries’s pay are pathetic - both the median of the industry and the range ( Q3-Q1) is fairly minute. This probably shows both the starting pay and the latitude/margin to growth in these industry are limited. This is exemplify by “BPO/Call Centre”, “Recruitment/Staffing”, “Teaching/Training” and “Healthcare”.
Also, it is interesting to note that some of the low paying industry - “Call Centre”, “Education”, “Hospital” are industries that posted most job on the database for the duration of the data subset.
hcboxplot(x = mdt$m.payrate.Mean,
var = mdt$m.experience.Median) %>%
hc_chart(type = "column") %>%
hc_title(text = "Plot of Payrate(Mean, PA) vs experience (yrs)")
filtered.IMP <- mdt[!is.na(m.experience.Median) & !is.na(m.payrate.Mean),
.SD, .SDcols=c("m.experience.Median","m.payrate.Mean")]
tempt_ggplot <- filtered.IMP %>%
ggplot(aes(x=factor(m.experience.Median), y=m.payrate.Mean)) +
geom_boxplot() +
geom_smooth(method="lm", se=FALSE, aes(group=1)) +
geom_text(x = 30, y = 4, label = sprintf("adj_r2: %s",
signif(
glance(lm(m.experience.Median~m.payrate.Mean, filtered.IMP))$adj.r.squared,
3)), parse = TRUE)
ggplotly(tempt_ggplot) %>%
layout(title = 'Payrate vs Experience',
xaxis = list(title = 'Experience (years)[range]', tickangle = 90),
yaxis = list(title = 'Annual Payrate'))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## archieve of using ggvis for plotting
# mdt[!is.na(m.payrate.Mean)] %>%
# ggvis(~industryS1, ~m.payrate.Mean) %>%
# layer_boxplots(fillOpacity := 0.1, width = 0.5) %>%
# add_axis("x", title = "Payrate PA Distribution",
# properties = axis_props(labels = list(angle = 90, align = "left", fontSize = 10))) %>%
# add_axis("y", title = "IndustryS1")
# mdt[!is.na(m.experience.Median)] %>%
# ggvis(~industryS1, ~m.experience.Median) %>%
# layer_boxplots(fillOpacity := 0.1, width = 0.5) %>%
# add_axis("x", title = "Payrate PA Distribution",
# properties = axis_props(labels = list(angle = 90, align = "left", fontSize = 10))) %>%
# add_axis("y", title = "IndustryS1")
If we assume that Pay/Wage grew exponentially \[Pay = Base * (1 + R) ^ T\] \[Log(Pay) = T * Log(Base * (1 + R))\] With regards to years
\[Log(m.experience.Median) = Log(Pay)\]
tempt_ggplot <- filtered.IMP %>%
ggplot(aes(x=factor(m.experience.Median), y=log(m.payrate.Mean))) +
geom_boxplot() +
geom_smooth(method="lm", formula=y~x, se=FALSE, aes(group=1)) +
geom_text(x = 30, y = 4, label = sprintf("adj_r2: %s",
signif(
glance(lm(m.experience.Median~log(m.payrate.Mean), filtered.IMP))$adj.r.squared,
3)), parse = TRUE)
ggplotly(tempt_ggplot) %>%
layout(title = 'log(Payrate) vs Experience',
xaxis = list(title = 'Experience (years)[range]', tickangle = 90),
yaxis = list(title = 'log(Annual Payrate)'))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
While both plots (with log(Pay) and pay) have ~ 0 p-value, the adjusted r-square of linear plot are higher than that of log plot, although 0.52 it is still not very significant. But perhaps we can narrow down to specific industry in later sections and explore in full.
hcboxplot(x = mdt$m.payrate.Mean,
var = mdt$m.experience.Cut,
var2 = mdt$industry,
outliers = FALSE) %>%
hc_chart(type = "column") %>%
hc_title(text = "Plot of Payrate(Mean, PA) vs Experience (Yrs) segregated on Industry")
This section examine the payrate commanded by people with different amount of Education with respect to their level of education.
mdt.melt.Education <- melt(mdt[!is.na(m.payrate.Mean)],
measure.vars = patterns("^m.Education.tf"),
na.rm=T)
hcboxplot(x = mdt.melt.Education$m.payrate.Mean,
var = mdt.melt.Education$m.experience.Cut,
var2 = mdt.melt.Education$variable,
outliers = FALSE) %>%
hc_chart(type = "column") %>%
hc_title(text = "Plot of Payrate(Mean, PA) vs Experience (Yrs) segregated on Education")
It seems that the level of education for fresh gradautes and people with up to 3 years worth of working experience does not resulted in a significant difference in payrate (ignoring outliers). owever, the difference in payscale of them may not appear significant when compare to other groups due to differences in magnitudes (y-axis).
On the next plot, the salary differences for poeple with little to no experience would be examined.
# Filtering to plot for only those jobs that listed 0 to 3 job experience.
hcboxplot(x = mdt.melt.Education[m.experience.Cut=="[0,3)"]$m.payrate.Mean,
var = mdt.melt.Education[m.experience.Cut=="[0,3)"]$variable,
name = "Length", color = "#2980b9")
It is possible that some of the job listings that depicted no experience requirement are in fact posts that required significant amount of working experience. The extracted experience requirement may not be able to capture the exact requirement if it is not listed explicitly.
The previous sections considers a job that listed multiple education requirement as multiple of job listing, Ie a job listed to be application to degree holder (UG), diploma, and PG would be melted into multiple entries for each level of education.
This, in effect, would lower the actual differences of payrates between different level of education. In actual market (which means the current employer market), those with advance degree are generally heavily favoured (not to say experience doesnt matter) while those with lower level of education would find it hard to compete.
Few employees, given a choice, would be willing to underemploy/undersell themselves for position that requires lower level of education or gives a loewr payrate.
hcboxplot(x = mdt$m.payrate.Mean,
var = mdt$m.experience.Cut,
var2 = mdt$m.Education.factor,
outliers = FALSE) %>%
hc_chart(type = "column")
# table<-mdt%>%
# .[,med.pay:=median(m.payrate.Mean), by =c("m.Education.factor", "m.experience.Cut") ]
Overall, it is quite obvious that assigning solely the single highest education requirement to individual job listing yield a much greater difference in their pay. I think this is a more accurate picture in reflecting the job market.
# industry_cout_table_40<-mdt[!is.na(m.payrate.Mean), count:.N, by= industry] %>%
# .[,.SD[order(-N)]] %>%
# head(., 40)
#
# industry_cout_table_40$industry <- droplevels(industry_cout_table_40$industry)
# factor(industry_cout_table_40$industry)
#
# mdt[industry_cout_table_40$industry,]
#
# (colnames(mdt) %in% industry_cout_table_40$industry)
unique(mdt$industry)
## [1] Media / Entertainment / Internet
## [2] Advertising / PR / MR / Event Management
## [3] IT-Software / Software Services
## [4] Banking / Financial Services / Broking
## [5] Aviation / Aerospace Firms
## [6] Industrial Products / Heavy Machinery
## [7] FMCG / Foods / Beverage
## [8] Recruitment / Staffing
## [9] Internet / Ecommerce
## [10] Travel / Hotels / Restaurants / Airlines / Railways
## [11] BPO / Call Centre / ITES
## [12] Pharma / Biotech / Clinical Research
## [13] Real Estate / Property
## [14] Insurance
## [15] Facility Management
## [16] Publishing
## [17] Education / Teaching / Training
## [18] Retail / Wholesale
## [19] Automobile / Auto Anciliary / Auto Components
## [20] Chemicals / PetroChemical / Plastic / Rubber
## [21] Government / Defence
## [22] Accounting / Finance
## [23] Textiles / Garments / Accessories
## [24] Semiconductors / Electronics
## [25] Telecom/ISP
## [26] Medical / Healthcare / Hospitals
## [27] Legal
## [28] Courier / Transportation / Freight / Warehousing
## [29] NGO / Social Services / Regulators / Industry Associations
## [30] Other
## [31] Architecture / Interior Design
## [32] KPO / Research / Analytics
## [33] Construction / Engineering / Cement / Metals
## [34] Office Equipment / Automation
## [35] Consumer Electronics / Appliances / Durables
## [36] Iron and Steel
## [37] Strategy / Management Consulting Firms
## [38] Oil and Gas / Energy / Power / Infrastructure
## [39] IT-Hardware & Networking
## [40] Wellness / Fitness / Sports / Beauty
## [41] Agriculture / Dairy
## [42] Electricals / Switchgears
## [43] Security / Law Enforcement
## [44] Gems / Jewellery
## [45] Ceramics / Sanitary ware
## [46] Fresher / Trainee / Entry Level
## [47] Food Processing
## [48] Printing / Packaging
## [49] Brewery / Distillery
## [50] Telecom/ISP /
## [51] Export / Import
## [52] Heat Ventilation / Air Conditioning
## [53] Wellness / Fitness / Sports
## [54] Water Treatment / Waste Management
## [55] <NA>
## [56] Shipping / Marine
## [57] Glass / Glassware
## [58] Animation / Gaming
## [59] Fertilizers / Pesticides
## [60] Pulp and Paper
## [61] Tyres
## [62] Leather
## [63] Broadcasting
## [64] Medical Devices / Equipments
## 64 Levels: Accounting / Finance ...
# limited the plotting to industry with more than 100 listing
t.filtered.mdt <- mdt[!is.na(m.payrate.Mean), if (.N > 150) .SD, by = industry] %>%
droplevels(.)
hcboxplot(x = t.filtered.mdt$m.payrate.Mean,
var = t.filtered.mdt$m.experience.Cut,
var2 = t.filtered.mdt$industry,
outliers = FALSE) %>%
hc_chart(type = "column") %>%
hc_title(text = "Plot of Payrate(Mean, PA) vs Experience (Yrs) segregated on Education")
# limited the plotting to industry with more than 100 listing
t.filtered.mdt <- mdt[!is.na(m.payrate.Mean), if (.N > 110 & .N<150) .SD, by = industry] %>%
droplevels(.)
hcboxplot(x = t.filtered.mdt$m.payrate.Mean,
var = t.filtered.mdt$m.experience.Cut,
var2 = t.filtered.mdt$industry,
outliers = FALSE) %>%
hc_chart(type = "column") %>%
hc_title(text = "Plot of Payrate(Mean, PA) vs Experience (Yrs) segregated on Education")
mdt[!is.na(m.payrate.Mean)&!is.na(m.experience.Median)&!is.na(industry),
list(intercept=coef(lm(m.payrate.Mean~m.experience.Median))[1],
coef=coef(lm(m.payrate.Mean~m.experience.Median))[2],
p.value=summary(lm(m.payrate.Mean~m.experience.Median))$coefficients[,4] ),
by=industry]
## # A tibble: 121 x 4
## industry intercept coef
## <fctr> <dbl> <dbl>
## 1 Media / Entertainment / Internet 92674.30 97973.74
## 2 Media / Entertainment / Internet 92674.30 97973.74
## 3 Advertising / PR / MR / Event Management 6816.69 158496.92
## 4 Advertising / PR / MR / Event Management 6816.69 158496.92
## 5 IT-Software / Software Services -60679.97 164657.08
## 6 IT-Software / Software Services -60679.97 164657.08
## 7 Industrial Products / Heavy Machinery -151875.01 140574.62
## 8 Industrial Products / Heavy Machinery -151875.01 140574.62
## 9 Recruitment / Staffing 47271.63 88297.83
## 10 Recruitment / Staffing 47271.63 88297.83
## # ... with 111 more rows, and 1 more variables: p.value <dbl>
mdt[!is.na(m.payrate.Mean), .N, by=industry] %>%
.[N>10,] %>%
.[,.SD[order(-N)]]
## # A tibble: 41 x 2
## industry N
## <fctr> <int>
## 1 IT-Software / Software Services 1249
## 2 BPO / Call Centre / ITES 741
## 3 Banking / Financial Services / Broking 279
## 4 Automobile / Auto Anciliary / Auto Components 195
## 5 Medical / Healthcare / Hospitals 179
## 6 Education / Teaching / Training 138
## 7 Travel / Hotels / Restaurants / Airlines / Railways 128
## 8 Internet / Ecommerce 116
## 9 Recruitment / Staffing 115
## 10 Pharma / Biotech / Clinical Research 109
## # ... with 31 more rows
mdt[!is.na(m.payrate.Mean)&!is.na(m.experience.Median)&!is.na(industry),] %>%
group_by(industry) %>%
do(tidy(lm(m.payrate.Mean ~ m.experience.Median, data = .)))
## # A tibble: 121 x 6
## # Groups: industry [61]
## industry term estimate
## <fctr> <chr> <dbl>
## 1 Accounting / Finance (Intercept) 192261.14
## 2 Accounting / Finance m.experience.Median 92255.89
## 3 Advertising / PR / MR / Event Management (Intercept) 6816.69
## 4 Advertising / PR / MR / Event Management m.experience.Median 158496.92
## 5 Agriculture / Dairy (Intercept) -263829.02
## 6 Agriculture / Dairy m.experience.Median 145962.64
## 7 Animation / Gaming (Intercept) 35227.27
## 8 Animation / Gaming m.experience.Median 52272.73
## 9 Architecture / Interior Design (Intercept) 109488.29
## 10 Architecture / Interior Design m.experience.Median 65050.20
## # ... with 111 more rows, and 3 more variables: std.error <dbl>,
## # statistic <dbl>, p.value <dbl>
Since we have most abundant data for the listing of jobs for “IT - Software / Software services” industry. Let’s explore this particular industry in more depth.
## Time of posting
# fitering for only data with valid payrate data and IT software industry
mdt.melt.location %>%
.[!is.na(m.payrate.Mean)&industry=="IT-Software / Software Services",] %>%
.[!is.na(m.timeCut), .N, by=m.timeCut] %>%
ggvis(~m.timeCut, ~N) %>%
layer_bars(fillOpacity := 0.1) %>%
add_axis("x", properties = axis_props(
labels = list(angle = 45, align = "left", fontSize = 10)
))
# # melting for education , this would create duplicate job posting for difference level of education
# mdt.melt.location.Education <- melt(mdt.melt.location,
# measure.vars = patterns ("^m.Education.tf."), na.rm = T, variable.name = "EducationLevel")
# mdt.melt.location.Education$value <- NULL
#
# mdt.melt.location.Education <- mdt.melt.location.Education %>%
# .[!is.na(m.payrate.Mean)|industry=="IT-Software / Software Services",]
#
# colnames(mdt.melt.location)
t.filtered.mdt<- mdt %>%
.[!is.na(m.payrate.Mean)&industry=="IT-Software / Software Services"]
# depiciting only the highest educational level per job listing
hcboxplot(x = t.filtered.mdt$m.payrate.Mean,
var = t.filtered.mdt$m.experience.Cut,
var2 = t.filtered.mdt$m.Education.factor,
outliers = FALSE) %>%
hc_chart(type = "column") %>%
hc_title(text = "Plot of Payrate(Mean, PA) vs Experience (Yrs) segregated on Education")
## Highcharter plot to depict the location of job posting on map
highcharter.plot <- mdt.melt.location %>%
.[!is.na(m.payrate.Mean)&industry=="IT-Software / Software Services",] %>%
.[, .(.N, Paymedian=median(m.payrate.Mean),
Q1=quantile(m.payrate.Mean, 0.25),
Q3=quantile(m.payrate.Mean, 0.75)), by = m.i.location] %>%
.[order(-N)] %>%
.[!is.na(m.i.location)] %>%
.[mydat, on="m.i.location"]%>%
.[lon>60&lon<90] %>%
.[lat>5&lat<30] %>%
.[,`:=`(
name=m.i.location,
z=N,
color= colorize(Paymedian)
)]
## highcharter
hcmap("countries/in/in-all", name = "India") %>%
hc_add_series(data = highcharter.plot, type = "mapbubble",
name = "Job listing", maxSize = "25") %>%
hc_title(text = "Locations of Job Posting for IT Software") %>%
hc_subtitle(text = "only top 100 locations are included, the calculation allowed multiple entries if the job is listed at multiple location") %>%
hc_tooltip(useHTML = TRUE,
headerFormat = "<table>",
pointFormat = paste("<tr><th colspan=\"1\">N Job Listing<td>{point.z}</td></th></tr>",
"<tr><th>Location Name</th><td>{point.name} </td></tr>",
"<tr><th>Pay Median</th><td>{point.Paymedian} </td></tr>",
"<tr><th>Pay Q1</th><td>{point.Q1} </td></tr>",
"<tr><th>Pay Q3</th><td>{point.Q3} </td></tr>"),
footerFormat = "</table>")
# # Fit lm model using 5 x 5-fold CV: model
# model <- train(
# m.payrate.Mean ~ m.experience.Median + EducationLevel,
# mdt.melt.location.Education,
# method = "rpart", na.action = na.omit,
# trControl = trainControl(
# method = "cv", number = 5, verboseIter = TRUE
# )
# )
#
# # Print model to console
# model
# depiciting only the highest educational level per job listing
hcboxplot(x = t.filtered.mdt$m.payrate.Mean,
var = t.filtered.mdt$m.experience.Cut,
var2 = t.filtered.mdt$m.Education.factor,
outliers = FALSE) %>%
hc_chart(type = "column") %>%
hc_title(text = "Plot of Payrate(Mean, PA) vs Experience (Yrs) segregated on Education")%>%
hc_tooltip(useHTML = TRUE,
headerFormat = "<table>",
pointFormat = paste("<tr><th colspan=\"1\">N Job Listing<td>{point.z}</td></th></tr>",
"<tr><th>Location Name</th><td>{point.name} </td></tr>",
"<tr><th>Pay Median</th><td>{point.Paymedian} </td></tr>",
"<tr><th>Pay Q1</th><td>{point.Q1} </td></tr>",
"<tr><th>Pay Q3</th><td>{point.Q3} </td></tr>"),
footerFormat = "</table>")
## Time of posting
# fitering for only data with valid payrate data and IT software industry
mdt %>%
.[!is.na(m.payrate.Mean)&industry=="Medical / Healthcare / Hospitals"] %>%
.[!is.na(m.timeCut), .N, by=m.timeCut] %>%
ggvis(~m.timeCut, ~N) %>%
layer_bars(fillOpacity := 0.1) %>%
add_axis("x", properties = axis_props(
labels = list(angle = 45, align = "left", fontSize = 10)
))
This may yield false information as the trends may be results of data subseting, though, it certainly shows that most of the job listing for healthcare occured around May, March for 2016.
## Highcharter plot to depict the location of job posting on map
highcharter.plot <- mdt.melt.location %>%
.[!is.na(m.payrate.Mean)&industry=="Medical / Healthcare / Hospitals",] %>%
.[, .(.N, Paymedian=median(m.payrate.Mean),
Q1=quantile(m.payrate.Mean, 0.25),
Q3=quantile(m.payrate.Mean, 0.75)), by = m.i.location] %>%
.[order(-N)] %>%
.[!is.na(m.i.location)] %>%
.[mydat, on="m.i.location"]%>%
.[lon>60&lon<90] %>%
.[lat>5&lat<30] %>%
.[,`:=`(
name=m.i.location,
z=N,
color= colorize(Paymedian)
)]
# Highcharter
hcmap("countries/in/in-all", name = "India") %>%
hc_add_series(data = highcharter.plot, type = "mapbubble",
name = "Job listing", maxSize = "25") %>%
hc_title(text = "Locations of Job Posting for Healthcare") %>%
hc_subtitle(text = "only top 100 locations are included, the calculation allowed multiple entries if the job is listed at multiple location") %>%
hc_tooltip(useHTML = TRUE,
headerFormat = "<table>",
pointFormat = paste("<tr><th colspan=\"1\">N Job Listing<td>{point.z}</td></th></tr>",
"<tr><th>Location Name</th><td>{point.name} </td></tr>",
"<tr><th>Pay Median</th><td>{point.Paymedian} </td></tr>",
"<tr><th>Pay Q1</th><td>{point.Q1} </td></tr>",
"<tr><th>Pay Q3</th><td>{point.Q3} </td></tr>"),
footerFormat = "</table>")
t.filtered.mdt<-mdt %>%
.[!is.na(m.payrate.Mean)&industry=="Medical / Healthcare / Hospitals"]
hcboxplot(x = t.filtered.mdt$m.payrate.Mean,
var = t.filtered.mdt$m.experience.Cut,
var2 = t.filtered.mdt$m.Education.factor,
outliers = FALSE) %>%
hc_chart(type = "column") %>%
hc_title(text = "Plot of Payrate(Mean, PA) vs Experience (Yrs) segregated on Education") %>%
hc_subtitle(text = "the extra series is allocated for job listing with no clear education requirement")
require(leaflet)
## Loading required package: leaflet
## Warning: package 'leaflet' was built under R version 3.3.3
software.icons <- awesomeIcons(
icon = 'fa-github',
library = 'fa'
)
internet.icons <- awesomeIcons(
icon = 'fa-internet-explorer',
library = 'fa'
)
bank.icon <- awesomeIcons(
icon = 'fa-institution',
library = 'fa'
)
factory.icon <- awesomeIcons(
icon = ' fa-industry',
library = 'fa'
)
recruit.icon <- awesomeIcons(
icon = 'fa-handshake-o',
library = 'fa'
)
hospital.icon <- awesomeIcons(
icon = 'fa-h-square ',
library = 'fa'
)
pharma.icon <- awesomeIcons(
icon = 'fa-medkit',
library = 'fa'
)
phone.icon <- awesomeIcons(
icon = 'fa-phone',
library = 'fa'
)
# join mdt.melt.locaiont ( which duplicates for location ) with geocoded address(mydat)
combined<- mdt.melt.location[,if(.N>1000).SD, by=industry] %>%
.[,if(.N>200).N, by=.(industry, m.i.location)] %>%
.[mydat, on="m.i.location"] %>%
.[!is.na(industry)] %>%
.[lon>60&lon<90] %>%
.[lat>5&lat<30]%>%
droplevels(.)
list_industry<-unique(combined$industry)
individual_listing <- mdt.melt.location[, .(industry, m.i.location)] %>%
.[industry %in% list_industry] %>%
.[,.N, by=c("industry", "m.i.location")]%>%
.[mydat, on="m.i.location"] %>%
.[!is.na(lon)|!is.na(lat)]%>%
.[lon>60&lon<90] %>%
.[lat>5&lat<30]%>%
droplevels(.)
individual_listing[industry=="IT-Software / Software Services"]
## # A tibble: 67 x 6
## industry m.i.location N i.N lon
## <fctr> <chr> <int> <int> <dbl>
## 1 IT-Software / Software Services Bengaluru 4501 9760 77.59456
## 2 IT-Software / Software Services Bangalore 3429 7335 77.59456
## 3 IT-Software / Software Services Mumbai 1939 5959 72.87766
## 4 IT-Software / Software Services Hyderabad 2145 4570 78.48667
## 5 IT-Software / Software Services Delhi 1349 4103 77.10249
## 6 IT-Software / Software Services Chennai 1284 3369 80.27072
## 7 IT-Software / Software Services Secunderabad 1440 2858 78.49827
## 8 IT-Software / Software Services Noida 1274 2063 77.39103
## 9 IT-Software / Software Services Gurgaon 772 1909 77.02664
## 10 IT-Software / Software Services Pune 521 1228 73.85674
## # ... with 57 more rows, and 1 more variables: lat <dbl>
leaflet(individual_listing) %>% addTiles() %>%
addProviderTiles(providers$CartoDB.Positron)%>%
#overlay groups
# addAwesomeMarkers(~lon, ~lat, icon=hospital.icon, label=~industry,
# data =individual_listing[industry=="Medical / Healthcare / Hospitals"],
# clusterOptions = T, group ="Healthcare") %>%
# addAwesomeMarkers(~lon, ~lat, icon=software.icons, label=~as.character(industry),
# data =individual_listing[industry=="IT-Software / Software Services"],
# clusterOptions = T, group = "IT") %>%
# addAwesomeMarkers(~lon, ~lat, icon=phone.icon, label=~as.character(industry),
# data =individual_listing[industry=="BPO / Call Centre / ITES"],
# clusterOptions = T, group ="Call Centre") %>%
# addAwesomeMarkers(~lon, ~lat, icon=bank.icon, label=~as.character(industry),
# data =individual_listing[industry=="Strategy / Management Consulting Firms"],
# clusterOptions = T, group = "Strategy") %>%
# addAwesomeMarkers(~lon, ~lat, icon=recruit.icon, label=~as.character(industry),
# data =individual_listing[industry=="Recruitment / Staffing"],
# clusterOptions = T, group = "Recruit") %>%
addCircles(~lon, ~lat, stroke = F, group = "recruit",
data =individual_listing[industry=="IT-Software / Software Services"] ) %>%
# Layers control
addLayersControl(
overlayGroups = c("Medical", "IT", "Call Centre", "Strategy", "Recruit"),
options = layersControlOptions(collapsed = FALSE)
)
individual_listing[industry=="IT-Software / Software Services"]
## # A tibble: 67 x 6
## industry m.i.location N i.N lon
## <fctr> <chr> <int> <int> <dbl>
## 1 IT-Software / Software Services Bengaluru 4501 9760 77.59456
## 2 IT-Software / Software Services Bangalore 3429 7335 77.59456
## 3 IT-Software / Software Services Mumbai 1939 5959 72.87766
## 4 IT-Software / Software Services Hyderabad 2145 4570 78.48667
## 5 IT-Software / Software Services Delhi 1349 4103 77.10249
## 6 IT-Software / Software Services Chennai 1284 3369 80.27072
## 7 IT-Software / Software Services Secunderabad 1440 2858 78.49827
## 8 IT-Software / Software Services Noida 1274 2063 77.39103
## 9 IT-Software / Software Services Gurgaon 772 1909 77.02664
## 10 IT-Software / Software Services Pune 521 1228 73.85674
## # ... with 57 more rows, and 1 more variables: lat <dbl>
individual_listing[industry=="Medical / Healthcare / Hospitals"]
## # A tibble: 0 x 6
## # ... with 6 variables: industry <fctr>, m.i.location <chr>, N <int>,
## # i.N <int>, lon <dbl>, lat <dbl>
unique(individual_listing$industry)
## [1] IT-Software / Software Services
## [2] Recruitment / Staffing
## [3] Banking / Financial Services / Broking
## [4] Internet / Ecommerce
## [5] Pharma / Biotech / Clinical Research
## [6] BPO / Call Centre / ITES
## [7] Education / Teaching / Training
## [8] Automobile / Auto Anciliary / Auto Components
## [9] Strategy / Management Consulting Firms
## [10] Construction / Engineering / Cement / Metals
## 10 Levels: Automobile / Auto Anciliary / Auto Components ...
leaflet(individual_listing) %>% addTiles() %>%
addProviderTiles(providers$CartoDB.Positron)%>%
#overlay groups
addCircleMarkers(~lon, ~lat, popup=~industry,
data =individual_listing[industry=="IT-Software / Software Services"],
clusterOptions = T, group ="IT") %>%
addCircleMarkers(~lon, ~lat, popup=~industry,
data =individual_listing[industry=="Recruitment / Staffing"],
clusterOptions = T, group ="Recruit") %>%
addCircleMarkers(~lon, ~lat, popup=~industry,
data =individual_listing[industry=="Banking / Financial Services / Broking"],
clusterOptions = T, group ="Bank") %>%
addCircleMarkers(~lon, ~lat, popup=~industry,
data =individual_listing[industry==list_industry[[4]]],
clusterOptions = T, group ="Internet") %>%
addCircleMarkers(~lon, ~lat, popup=~industry,
data =individual_listing[industry==list_industry[[5]]],
clusterOptions = T, group ="Pharma") %>%
# Layers control
addLayersControl(
overlayGroups = c("IT", "Recruit", "Bank", "Internet", "Pharma"),
options = layersControlOptions(collapsed = FALSE)
)
pal <- colorFactor("viridis", domain = unique(individual_listing$industry))
leaflet(individual_listing) %>% addTiles() %>%
addProviderTiles(providers$CartoDB.Positron)%>%
addCircles(lng = ~lon, lat = ~lat, weight = 10,
radius = ~N*10,
color = ~pal(industry),
popup = ~m.i.location
)%>%
addLegend("bottomright", pal = pal, values = ~industry,
title = "Industry",
opacity = 1
)